require( tidyverse )
## Loading required package: tidyverse
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tibble' was built under R version 3.4.1
## Warning: package 'tidyr' was built under R version 3.4.1
## Warning: package 'purrr' was built under R version 3.4.1
## Warning: package 'dplyr' was built under R version 3.4.1
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
require( data.table )
## Loading required package: data.table
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
require( maps )
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
require( zipcode )
## Loading required package: zipcode
require( lme4 )
## Loading required package: lme4
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.4.1
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
This dataset covers the period of 2000–2017 and includes 62560 records of fines levied against corporations relating to violations of regulations, from cases initiated by 43 federal regulatory agencies. I downloaded the data from the Good Jobs First “Violation Tracker”, using the search GUI with all the options set to <any>.
viol <- tbl_df( fread( "/Users/willpitchers/Documents/=Job_Applications_etc/Data_Incubator_2017/violation_tracker_export.csv" ))
names( viol ) <- gsub( " ", "_", names( viol ))
viol <- viol %>% mutate( Year=as.integer( Year ), Industry_code=factor( Industry_in_Record ), Civ_Crim=factor( `Civil/Criminal` ) ) %>%
mutate( HQ_State_of_Parent=factor( HQ_State_of_Parent ), HQ_Country_of_Parent=factor( HQ_Country_of_Parent ) ) %>%
mutate( Primary_Offense=factor( Primary_Offense ), Penalty_Amount=as.numeric( gsub( "[$,]", "", Penalty_Amount ) ) ) %>%
mutate( Penalty_Adj=as.numeric( gsub( "[$,]", "", Penalty_Amount_Adjusted_For_Eliminating_Multiple_Counting )) ) %>%
mutate( Subtraction_From_Penalty=as.numeric( "[$,]", "", Subtraction_From_Penalty ) ) %>%
mutate( Agency=factor( Agency ), Secondary_Offense=factor( Secondary_Offense ), Ownership_Structure=factor( Ownership_Structure ) ) %>%
mutate( Major_Industry_of_Parent=factor( Major_Industry_of_Parent ), Zip=factor( Zip ), Facility_State=factor( Facility_State ) )
## Warning in evalq(as.numeric("[$,]", "", Subtraction_From_Penalty),
## <environment>): NAs introduced by coercion
viol <- viol %>% mutate( Civ_Crim_bin=factor( ifelse( grepl( "civil and criminal", `Civil/Criminal` )=="TRUE", "both",
ifelse( grepl( "civil", `Civil/Criminal` )=="TRUE", "civil", "criminal" ))))
# str( viol )
# summary( viol$Year )
viol
This dataset contains fields for both Penalty_Amount and Penalty_Amount_Adjusted_For_Eliminating_Multiple_Counting, but is clear that these adjustments make very little difference to the data as a whole, as these two variables are correlation at r=0.994. I have thus elected to use the adjusted penalty values for all these analyses.
The first pattern to note is that (perhaps predictably) the penalties imposed via criminal proceedings tend to be larger than those imposed via civil proceedings, and larger still when both civil and criminal proceedings have been brought.
viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( Penalty_Adj )) +
geom_density( aes( col=Civ_Crim_bin, fill=Civ_Crim_bin ), alpha=.5 ) +
scale_x_log10() +
xlab( "Penalty (log10 $'s)" ) +
scale_fill_discrete( name="Type of\ncase brought") +
scale_colour_discrete( name="Type of\ncase brought" )
The way this pattern is built up is somewhat non-intuitive, as the linear model and boxplot below make clear. The groups mean are well-separated, and easy to distinguish statistically (small p-values), but there is so much variation within groups that the predictive value of the mean differences is small (low R2 value). The vast majority of penalties arise from civil actions – 99.3% of those recorded – and the mean value of these penalties is comparatively small.
summary( lm( Penalty_Adj ~ Civ_Crim_bin, data=viol ) )
##
## Call:
## lm(formula = Penalty_Adj ~ Civ_Crim_bin, data = viol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.913e+08 -4.644e+06 -4.640e+06 -4.627e+06 2.080e+10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 596240136 24921654 23.93 <2e-16 ***
## Civ_Crim_bincivil -591589764 24931682 -23.73 <2e-16 ***
## Civ_Crim_bincriminal -464209918 26451988 -17.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 176200000 on 62557 degrees of freedom
## Multiple R-squared: 0.01211, Adjusted R-squared: 0.01207
## F-statistic: 383.3 on 2 and 62557 DF, p-value: < 2.2e-16
viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( Civ_Crim_bin, Penalty_Adj )) +
geom_point( colour="blue", alpha=0.3, position="jitter" ) +
geom_boxplot( outlier.size=0, alpha=0 ) +
coord_flip() +
xlab( "Type of case brought" ) +
scale_y_log10() +
ylab( "Penalty (log10 $'s)" )
Modelling the relationship between penalty values and the year the penalty was imposed reveals an increasing trend, with the coefficent for year indicating an increase of ~$950k per year. However, the R2 value for this model indicates that it accounts for only 0.05% of the variation in the data.
This makes sense with reference to the scatter plot, where we can see that the trend for annual increase (represented by the dashed black line) is pretty modest when compared to huge variation in the size of penalties (note log10 scale).
summary( lm( Penalty_Adj ~ Year, data=viol ) )
##
## Call:
## lm(formula = Penalty_Adj ~ Year, data = viol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.323e+07 -9.358e+06 -6.516e+06 -2.701e+06 2.079e+10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.915e+09 3.175e+08 -6.032 1.63e-09 ***
## Year 9.562e+05 1.580e+05 6.051 1.45e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 177200000 on 62558 degrees of freedom
## Multiple R-squared: 0.0005849, Adjusted R-squared: 0.0005689
## F-statistic: 36.61 on 1 and 62558 DF, p-value: 1.449e-09
viol %>% filter( Penalty_Adj > 0 ) %>% ggplot( aes( x=Year, y=Penalty_Adj ) ) +
geom_jitter( aes( col=Civ_Crim_bin ), alpha=0.5, width=.2, height=0 ) +
ylab( "Penalty amount (log10 $'s)" ) +
geom_smooth( method="lm", col="black", lty=2 ) +
scale_colour_discrete( name="Type of\ncase brought" ) +
scale_y_log10()
Taking a broad-strokes view of the geographical data, we can see that there seems not to be a visually apparent trend in the locations where penalties are levied – this data appears to track pretty well with the locations of population centres, thought here may be more subtle patterns that are not visible at the nation-wide scale.
data( zipcode )
zipcode <- zipcode %>% mutate( zip=factor( zip ), region=substr( zip, 1, 1) )
full_join( viol, zipcode, by=c( "Zip" = "zip" ) ) %>% mutate( Zip=factor( Zip ) ) %>% filter( Civ_Crim_bin=="civil" ) %>%
ggplot() + geom_point( aes( x=longitude, y=latitude, col=Year ), cex=.5 ) +
theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NULL ) +
scale_y_continuous(limits = c(25,50), breaks = NULL ) +
labs(x=NULL, y=NULL)
However, if we look the number of penalties paid in each state/territory over the course of this 18-yr dataset, we can see that there are many more violations in some states that others, with West Virginia being responsible for 15.22% of all penalties levied.
viol %>% filter( Facility_State != "" ) %>% ggplot( aes( reorder( Facility_State, Facility_State, function(x)-length(x) ) )) +
geom_bar( aes( fill=Major_Industry_of_Parent )) +
coord_flip() +
theme( axis.text.y=element_text( hjust=0, size=9 ) ) +
ylab( "no. penalties" ) +
xlab( "" ) +
scale_fill_discrete( name="Industrial Sector" )
The bars are colored by the industrial sector of the parent corporation found liable for the penalty – the blue that occupies 93.43% of the West Virginia bar represents corporations classified as “mining and minerals”. It is apparent that WV is unusual in both the number of violations and the number of those violations related to mining.
Across the 49 industrial sectors represented, it is immediately clear from this barplot the extent to which mining & mineral corporations are over-represented (25.24% of all penalties). It is also clear that the pink areas – indicating ‘workplace safety or health’ violations – comprise the majority of violations in most sectors, but particularly so in mining.
viol %>% filter( Facility_State != "" ) %>% ggplot( aes( reorder( Major_Industry_of_Parent, Major_Industry_of_Parent, function(x)-length(x) ) )) +
geom_bar( aes( fill=Primary_Offense )) +
coord_flip() +
theme( axis.text.y=element_text( hjust=0, size=9 ) ) +
ylab( "no. penalties" ) +
xlab( "" ) +
scale_fill_discrete( name="Violation Categories" )
From these analyses so far, my preliminary recommendation would be that the primary focus of labor-safety lobbying efforts ought to be the West Virginian mining industry. This would also be a potentially fruitful focus for legislators and regulators, as the apparent abundance of regulatory violations would potentially allow for high statistical power in any empirical tests for the effects of regulatory/enforcement policy changes.
The over-representation of workplace health and safety violations is also noteworthy, and not limited to the mining sector, nor to West Virginia. Given the generality of this pattern, a potential ethical investor should be encouraged to investigate a corporation’s history of health & safety compliance as a priority.